fix(routing): prevent QueryEvent publish races#3490
Open
Conversation
Clone Responses (and each AddrInfo.Addrs) before delivery so a publisher can safely mutate its slice after the call returns.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The bug
PublishQueryEventforwardsQueryEvent.Responsesto subscribers by pointer. A publisher that keeps mutating its[]*peer.AddrInfo(or anyAddrInfo.Addrs) after the call returns races with every subscriber reading the event.This is the canonical shape in routing systems, for example DHT implementations: a worker aggregates "closer peers" into a slice, publishes it as a PeerResponse
event, then continues processing the same slice (e.g. enrichingAddrs` from a peerstore).This class of race was reported against in ipfs/kubo#11287 and ipfs/kubo#11116.
Proof
To make review easier, the first commit (
test: PublishQueryEvent Responses race) adds a minimal test that fails under-race. CI caught it ongo-testbefore the fix:https://github.com/libp2p/go-libp2p/actions/runs/24641378021/job/72045816212?pr=3490#step:19:580
The fix
Note
This is a general fix, not a kad-dht-specific one. Any current or future routing/DHT implementation that emits
QueryEvents can hit the same race; centralizing the copy incore/routingcovers all of them at once. I feel this is the right call to avoid similar bugs surfacing every few months when someone moves something around, or if golang changes something to make race more probable.PublishQueryEventnow deep-copiesResponses(and eachAddrInfo.Addrs) before handing the event to subscribers. Callers get a strengthened contract: mutate your own copy freely after publishing.ctx.Valueearly return is untouched)Multiaddrvalues stay shared; they are by-convention immutable